In [ ]:
%load_ext watermark
%watermark -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka' -v -p numpy,scipy,matplotlib,scikit-learn
In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
We first load the text data from the dataset
directory that should be located in your notebooks directory, which we created by running the fetch_data.py
script from the top level of the GitHub repository.
Furthermore, we perform some simple preprocessing and split the data array into two parts:
text
: A list of lists, where each sublists contains the contents of our emailsy
: our SPAM vs HAM labels stored in binary; a 1 represents a spam message, and a 0 represnts a ham (non-spam) message.
In [ ]:
import os
with open(os.path.join("datasets", "smsspam", "SMSSpamCollection")) as f:
lines = [line.strip().split("\t") for line in f.readlines()]
text = [x[1] for x in lines]
y = [int(x[0] == "spam") for x in lines]
In [ ]:
text[:10]
In [ ]:
y[:10]
In [ ]:
print('Number of ham and spam messages:', np.bincount(y))
In [ ]:
type(text)
In [ ]:
type(y)
Next, we split our dataset into 2 parts, the test and training dataset:
In [ ]:
from sklearn.model_selection import train_test_split
text_train, text_test, y_train, y_test = train_test_split(text, y,
random_state=42,
test_size=0.25,
stratify=y)
Now, we use the CountVectorizer to parse the text data into a bag-of-words model.
In [ ]:
from sklearn.feature_extraction.text import CountVectorizer
print('CountVectorizer defaults')
CountVectorizer()
In [ ]:
vectorizer = CountVectorizer()
vectorizer.fit(text_train)
X_train = vectorizer.transform(text_train)
X_test = vectorizer.transform(text_test)
In [ ]:
print(len(vectorizer.vocabulary_))
In [ ]:
X_train.shape
In [ ]:
print(vectorizer.get_feature_names()[:20])
In [ ]:
print(vectorizer.get_feature_names()[2000:2020])
In [ ]:
print(X_train.shape)
print(X_test.shape)
We can now train a classifier, for instance a logistic regression classifier, which is a fast baseline for text classification tasks:
In [ ]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf
In [ ]:
clf.fit(X_train, y_train)
We can now evaluate the classifier on the testing set. Let's first use the built-in score function, which is the rate of correct classification in the test set:
In [ ]:
clf.score(X_test, y_test)
We can also compute the score on the training set to see how well we do there:
In [ ]:
clf.score(X_train, y_train)
In [ ]:
def visualize_coefficients(classifier, feature_names, n_top_features=25):
# get coefficients with large absolute values
coef = classifier.coef_.ravel()
positive_coefficients = np.argsort(coef)[-n_top_features:]
negative_coefficients = np.argsort(coef)[:n_top_features]
interesting_coefficients = np.hstack([negative_coefficients, positive_coefficients])
# plot them
plt.figure(figsize=(15, 5))
colors = ["red" if c < 0 else "blue" for c in coef[interesting_coefficients]]
plt.bar(np.arange(50), coef[interesting_coefficients], color=colors)
feature_names = np.array(feature_names)
plt.xticks(np.arange(1, 51), feature_names[interesting_coefficients], rotation=60, ha="right");
In [ ]:
visualize_coefficients(clf, vectorizer.get_feature_names())
In [ ]:
vectorizer = CountVectorizer(min_df=2)
vectorizer.fit(text_train)
X_train = vectorizer.transform(text_train)
X_test = vectorizer.transform(text_test)
clf = LogisticRegression()
clf.fit(X_train, y_train)
print(clf.score(X_train, y_train))
print(clf.score(X_test, y_test))
In [ ]:
visualize_coefficients(clf, vectorizer.get_feature_names())
In [ ]:
#%load solutions/12A_tfidf.py
Change the parameters min_df and ngram_range of the TfidfVectorizer and CountVectorizer. How does that change the important features?
In [ ]:
#%load solutions/12B_vectorizer_params.py